fix(exp): Add a real agent example #3507

nina-kollman · 2025-12-08T16:08:56Z

Fixes TLP-1261

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

Introduces a travel agent experiment with trajectory capture and evaluation using Traceloop's evaluators, and enhances trajectory data extraction utilities.

New Features:
- Added travel_agent_exp.py to demonstrate a travel agent experiment using Traceloop's evaluators.
- The experiment captures prompt/completion trajectories and evaluates them using various metrics.
Enhancements:
- Added extract_trajectory_from_spans() and run_with_span_capture() in utils.py for capturing and extracting trajectory data.
- Updated agent_flow_quality() in evaluators_made_by_traceloop.py to accept threshold and conditions for finer-grained evaluation.

^{This description was created by}^{for e72a010. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Added a travel agent experiment that runs sample travel queries, captures execution traces in-memory, extracts prompt/completion/tool-call trajectories, and prints experiment summaries.
Enhancements
- Added utilities to capture and extract trajectory data from execution traces for use in experiments.
- Updated the agent flow quality evaluator to accept a configurable threshold and explicit conditions for finer-grained evaluation and reporting.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-08T16:09:03Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a travel-agent experiment module that runs travel queries, captures OpenTelemetry spans to extract trajectory prompts/completions and tool calls, provides utilities to extract trajectories and run tasks with an in-memory span capture, and updates an evaluator API to accept threshold and conditions parameters.

Changes

Cohort / File(s)	Summary
Travel Agent Experiment `packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py`	New module: defines `async def travel_agent_task(row)` and `async def run_travel_agent_experiment()`; reinitializes Traceloop per task with an in-memory exporter, invokes `run_travel_query`, captures spans via experiment utilities, extracts/encodes `trajectory_prompts` and `trajectory_completions`, selects `final_completion`, and returns structured payloads (prompt, answer, context, trajectories). Includes async CLI entrypoint.
Experiment Utilities `packages/traceloop-sdk/traceloop/sdk/experiment/utils.py`	New utilities: `extract_trajectory_from_spans(spans)` collects span attributes (gen_ai.prompt., gen_ai.completion., gen_ai.tool.) into ordered prompts, completions, and tool call records; `async def run_with_span_capture(task_callable, args, **kwargs)` resets tracer state, installs an InMemorySpanExporter, executes the async task, gathers finished spans, derives JSON trajectories and the chosen final completion, prints a summary, and returns `(json_trajectory_prompts, json_trajectory_completions, final_completion)`. Exceptions are propagated.
Evaluator SDK Enhancement `packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py`	Typing import `List` added; `EvaluatorMadeByTraceloop.agent_flow_quality()` signature changed to `(threshold: float = 0.5, conditions: List[str] = [])` and its returned config now includes `threshold` and `conditions` keys. Docstring updated accordingly.
Type-Checking Config `packages/traceloop-sdk/pyproject.toml`	Added the new experiment utils module to the `[tool.mypy] exclude` list to skip static type checking for that file.

Sequence Diagram

sequenceDiagram
    participant Orchestrator as run_travel_agent_experiment()
    participant Task as travel_agent_task(row)
    participant Agent as travel_agent_example.run_travel_query
    participant Tracer as InMemory OTEL Exporter
    participant Extract as extract_trajectory_from_spans()
    participant Evaluator as EvaluatorMadeByTraceloop

    Orchestrator->>Orchestrator: configure evaluators (threshold/conditions)
    Orchestrator->>Orchestrator: load travel-queries dataset

    loop per query row
        Orchestrator->>Task: invoke travel_agent_task(row)
        Task->>Tracer: reset/init in-memory exporter
        Task->>Agent: run_travel_query(row.query)
        Agent->>Tracer: emit spans (gen_ai.prompt.*, gen_ai.completion.*, gen_ai.tool.*)
        Tracer-->>Task: finished spans captured
        Task->>Extract: extract_trajectory_from_spans(spans)
        Extract-->>Task: return trajectories (prompts, completions, tool calls)
        Task-->>Orchestrator: return result with trajectory JSON and final_completion
    end

    Orchestrator->>Evaluator: run evaluators on collected results
    Evaluator-->>Orchestrator: evaluation summaries
    Orchestrator->>Orchestrator: print results summary

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify ordering and index handling in extract_trajectory_from_spans() for correctness.
Confirm run_with_span_capture() safely resets tracer state and that the InMemory exporter isn't shared across concurrent tasks.
Review JSON encoding and empty/missing trajectory handling in travel_agent_task().
Pay attention to the mutable default conditions: List[str] = [] in agent_flow_quality() (consider using None).

Poem

🐰✈️ I chased tiny spans across the map at dawn,
Breadcrumb prompts and answers I neatly braid,
Tools whispered secrets, completions carried on,
I hopped through traces where each query played,
Evaluators smiled — my trail well-laid. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Linked Issues check	❓ Inconclusive	The linked issue TLP-1261 contains no specific coding requirements, making compliance assessment impossible.	Review the linked issue details to identify specific coding requirements and validate that all changes address the stated objectives.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(exp): Add a real agent example' directly relates to the main change - adding a travel agent experiment example with supporting utilities.
Out of Scope Changes check	✅ Passed	All changes are directly related to adding the travel agent experiment: new travel_agent_exp.py module, enhanced agent_flow_quality evaluator, and new trajectory extraction utilities.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch nk/travel_agent_case

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

…nk/travel_agent_case

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to f19d82c in 2 minutes and 57 seconds. Click for details.

Reviewed 355 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:33

Draft comment:
Verify that 'spec' and 'spec.loader' are not None before using them to load the module. This prevents silent failures if the file is missing or the spec fails to load.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment identifies a legitimate potential issue - if the file is missing or spec fails to load, the code will raise an AttributeError when trying to access spec.loader or call module_from_spec(None). However, the comment violates the rules in two ways: 1) It starts with "Verify that..." which is explicitly called out as a pattern to avoid, and 2) It's somewhat speculative about "silent failures" - if spec is None, this would actually raise an exception, not fail silently. The comment is asking the author to check/verify rather than stating a clear code change. This is a new file being added, so this is about changes made in the diff. While there is a real potential issue here (missing null checks), the phrasing "Verify that..." makes this an ask for the author to check rather than a clear actionable suggestion. Also, the failure wouldn't actually be "silent" - it would raise an AttributeError. The comment could be valid if rephrased more directly as "Add null checks for spec and spec.loader". The comment does identify a real code quality issue where proper error handling is missing. However, the phrasing violates the stated rules about not asking authors to "verify" or "ensure" things. The rules explicitly state "Do NOT ask the PR author to confirm their intention, to explain, to double-check things, to ensure the behavior is intended" and give "Verify that..." as an example of what not to do. This comment should be deleted because it uses the "Verify that..." pattern which is explicitly prohibited in the rules. While it identifies a legitimate issue, the phrasing asks the author to check/verify rather than stating a clear code change requirement.

2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:293

Draft comment:
Returned result keys ('trajectory_prompts'/'trajectory_completions') differ from what the sample results block expects ('prompts'/'completions'). Ensure consistent naming or proper field mapping.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 35% vs. threshold = 50% The comment identifies a legitimate inconsistency between what the function returns and what the results printing code expects. However, I need to consider: 1) This is in a print/logging section that displays sample results, not core functionality. 2) The .get() method with a default value means this won't crash - it will just show incorrect/empty values. 3) The comment says "Ensure consistent naming or proper field mapping" which is somewhat vague and doesn't provide a clear action. 4) According to the rules, I should not keep comments that ask the author to "ensure" things. 5) The actual evaluators might work fine with 'trajectory_prompts'/'trajectory_completions' - the issue is only in the display code. 6) This is a new file being added, so this is part of the changes. The comment does identify a real bug where the results printing will show incorrect values. This isn't just informational - it's pointing out that lines 299-300 will not work as intended because they're looking for keys that don't exist in the returned dict. The comment starts with "Ensure" which typically violates the rules, but it's also identifying a concrete mismatch that would lead to incorrect output. While the comment uses "Ensure" language which is typically discouraged, it does identify a concrete bug where the print statements will fail to display the intended information. The .get() calls will return empty strings and the split operations will produce incorrect counts. This is actionable - the author should either change the return keys or the print statement keys to match. It's not speculative or asking for confirmation. This comment identifies a real inconsistency that will cause the results display to show incorrect information. While it uses "Ensure" language, it points to a concrete bug in the code. However, the phrasing could be more direct about what needs to be fixed. Given the rules emphasize removing comments unless there's STRONG EVIDENCE they're correct, and this is a relatively minor display issue (not core functionality), I'll lean toward removing it.

3. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:136

Draft comment:
Rather than directly deleting the 'TracerWrapper.instance', consider implementing a proper reset API for the tracer to avoid potential side effects.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a code quality suggestion about using a proper API instead of directly manipulating internal state. However, there are several issues: 1) This is a new file being added, so the author clearly wrote this intentionally as a workaround, 2) The comment doesn't provide evidence that a reset API exists in TracerWrapper, 3) Without seeing the TracerWrapper implementation, we can't know if this approach is problematic or if there's a better alternative, 4) The comment is speculative ("potential side effects") without demonstrating actual issues, 5) The local variable assignment on line 137 suggests the author may have had a reason for this approach (possibly to keep a reference). The rules state we should only keep comments with STRONG EVIDENCE they are correct, and speculative comments should be removed. The comment might be valid if TracerWrapper does have a proper reset API that should be used instead. The author might not be aware of it. Also, directly deleting singleton instances can indeed cause issues in some cases. However, without evidence that such an API exists or that this approach causes problems, this is speculative. While the concern about proper APIs is generally valid, this comment lacks strong evidence. It doesn't show that a reset API exists, doesn't demonstrate actual side effects, and is speculative ("potential side effects"). The code appears intentional given the comment on line 135 explaining the purpose. Without access to TracerWrapper's implementation or evidence of problems, this is a speculative suggestion that should be removed per the rules. This comment should be deleted. It's a speculative suggestion without strong evidence of an actual problem or proof that a better API exists. The rules require strong evidence for keeping comments, and this doesn't meet that bar.

Workflow ID: wflow_kMz1aMaasqEpWUhZ

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (2)

31-38: Guard dynamic travel agent loading for clearer errors

_load_travel_agent_module assumes the agents_path exists and that spec/spec.loader is non-None. If the file is missing or the path changes, this will fail with a somewhat opaque AttributeError.

You could make failures clearer:

 def _load_travel_agent_module():
     """Dynamically load the travel_agent_example module."""
     agents_path = Path(__file__).parent.parent.parent / "agents" / "travel_agent_example.py"
-    spec = importlib.util.spec_from_file_location("travel_agent_example", agents_path)
-    module = importlib.util.module_from_spec(spec)
-    sys.modules["travel_agent_example"] = module
-    spec.loader.exec_module(module)
-    return module
+    if not agents_path.exists():
+        raise FileNotFoundError(f"travel_agent_example.py not found at {agents_path}")
+
+    spec = importlib.util.spec_from_file_location("travel_agent_example", agents_path)
+    if spec is None or spec.loader is None:
+        raise ImportError(f"Could not load spec for travel_agent_example from {agents_path}")
+
+    module = importlib.util.module_from_spec(spec)
+    sys.modules["travel_agent_example"] = module
+    spec.loader.exec_module(module)
+    return module

This keeps the example robust while failing fast with a helpful message if the file isn’t where it’s expected.

185-219: Normalize trajectory JSON handling and include tool_calls in the task result

The current trajectory handling has a few rough edges:

For empty trajectories, you set trajectory_prompts = json.dumps([]) / trajectory_completions = json.dumps([]) and then later call json.dumps again, which yields "[]" (a JSON string of a JSON string) instead of [].
The “Prompt attributes captured” / “Completion attributes captured” counts are based on mixed dict/string types, so counts become misleading when empty.
The returned task result doesn’t expose tool_calls, even though they’re collected and later referenced in the sample results.

You can address all three by:

-        # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys
-        # If empty, use JSON string fallback to avoid validation errors
-        trajectory_prompts = trajectory_data["trajectory_prompts"]
-        trajectory_completions = trajectory_data["trajectory_completions"]
-
-        # Convert to JSON strings if empty (evaluators expect string when no data)
-        if not trajectory_prompts:
-            trajectory_prompts = json.dumps([])
-        if not trajectory_completions:
-            trajectory_completions = json.dumps([])
-
-        print("📊 Trajectory Summary:")
-        print(f"  - Prompt attributes captured: {len(trajectory_prompts)}")
-        print(f"  - Completion attributes captured: {len(trajectory_completions)}")
+        # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys
+        trajectory_prompts = trajectory_data["trajectory_prompts"]
+        trajectory_completions = trajectory_data["trajectory_completions"]
+
+        prompt_attr_count = len(trajectory_prompts)
+        completion_attr_count = len(trajectory_completions)
+
+        # Convert to JSON strings for evaluator input; use [] when empty
+        json_trajectory_prompts = json.dumps(trajectory_prompts or [])
+        json_trajectory_completions = json.dumps(trajectory_completions or [])
+
+        print("📊 Trajectory Summary:")
+        print(f"  - Prompt attributes captured: {prompt_attr_count}")
+        print(f"  - Completion attributes captured: {completion_attr_count}")
@@
-        # - "context" maps to "reference"
-        # - "trajectory_prompts" and "trajectory_completions" as dicts with llm.prompts/completions.* keys
-        
-        json_trajectory_prompts = json.dumps(trajectory_prompts)
-        json_trajectory_completions = json.dumps(trajectory_completions)
-        # prompt_list = str(trajectory_prompts)
-        # completion_list = str(trajectory_completions)
-
-        return {
-            "prompt": query,
-            "answer": final_completion if final_completion else query,
-            "context": f"The agent should create a complete travel itinerary for: {query}",
-            "trajectory_prompts": json_trajectory_prompts,  
-            "trajectory_completions": json_trajectory_completions,  
-        }
+        # - "context" maps to "reference"
+        # - "trajectory_prompts" and "trajectory_completions" as JSON strings with llm.prompts/completions.* keys
+
+        return {
+            "prompt": query,
+            "answer": final_completion if final_completion else query,
+            "context": f"The agent should create a complete travel itinerary for: {query}",
+            "trajectory_prompts": json_trajectory_prompts,
+            "trajectory_completions": json_trajectory_completions,
+            "tool_calls": trajectory_data["tool_calls"],
+        }

This keeps evaluator inputs JSON-encoded (with a clean [] for the empty case), fixes the counts, and surfaces tool_calls for downstream inspection.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f2cda38 and f19d82c.

📒 Files selected for processing (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

packages/traceloop-sdk/traceloop/sdk/evaluator/config.py (1)

EvaluatorDetails (5-23)

🪛 Flake8 (7.3.0)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

[error] 137-137: local variable '_trace_wrapper_instance' is assigned to but never used

(F841)

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

213-219: Consider moving this statement to an else block

(TRY300)

221-222: Remove exception handler; error is immediately re-raised

(TRY203)

222-222: Use raise without specifying exception name

Remove exception name

(TRY201)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

571-571: Do not use mutable data structures for argument defaults

Replace with None; initialize within function

(B006)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (2)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (1)

241-255: Docstring update for word_count_ratio looks correct

Including denominator_text in the “Required task output fields” aligns the docstring with the actual required_input_fields. No changes needed here.
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)
292-300: Fix sample results printing to match actual task return structure

The code at lines 298-300 accesses keys that don't exist in task_result. The travel_agent_task function returns prompt, answer, context, trajectory_prompts, and trajectory_completions, but not tool_calls, prompts, or completions.

Update to:
print(f"    - Tools used: {len(task_result.get('tool_calls', []))}")

prompts_json = task_result.get("trajectory_prompts", "[]")
completions_json = task_result.get("trajectory_completions", "[]")
try:
    prompts_obj = json.loads(prompts_json)
    completions_obj = json.loads(completions_json)
except json.JSONDecodeError:
    prompts_obj = {}
    completions_obj = {}

print(f"    - Prompt attributes captured: {len(prompts_obj)}")
print(f"    - Completion attributes captured: {len(completions_obj)}")
Note: If tool_calls should be displayed here, it must first be added to the return dict of travel_agent_task (currently it's extracted in extract_trajectory_from_spans but not included in the final return).

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 5dab9df in 1 minute and 38 seconds. Click for details.

Reviewed 71 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 5 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:27

Draft comment:
Removed redundant comment before the module loading function. Consider adding a more descriptive module-level docstring if needed.
Reason this comment was not posted:
Comment looked like it was already resolved.

2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:135

Draft comment:
Removed the unused binding to _trace_wrapper_instance when clearing the singleton. This cleanup eliminates clutter without affecting functionality.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, explaining a change that was made without suggesting any action or asking for clarification. It doesn't provide any actionable feedback or raise any concerns about the code.

3. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:196

Draft comment:
Refactored assignment of 'tools_called' improves readability of the print statement.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative and does not provide any actionable feedback or suggestions for improvement. It simply states that the refactoring improves readability, which is not necessary for the PR author to know.

4. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:208

Draft comment:
Updated comment to reflect that 'trajectory_prompts' and 'trajectory_completions' are returned as JSON strings, aligning with the fallback behavior when they are empty. Verify that downstream evaluators expect a JSON string rather than a dict.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to verify that downstream evaluators expect a JSON string rather than a dict. This falls under the category of asking the author to confirm or verify something, which is against the rules. Therefore, this comment should be removed.

5. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:270

Draft comment:
Experiment slug updated from 'travel-agent-exp-2' to 'travel-agent-exp-3'. Ensure that related documentation and platform configurations are updated accordingly.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% The comment is asking the PR author to ensure that related documentation and platform configurations are updated, which falls under the rule of not asking the author to ensure things are tested or verified. This is not a specific code suggestion or a request for a test to be written.

Workflow ID: wflow_nw1dA9YKK30uCA8Z

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f19d82c and 5dab9df.

📒 Files selected for processing (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

🧠 Learnings (1)

📓 Common learnings

Learnt from: CR
Repo: traceloop/openllmetry PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-08-17T15:06:48.109Z
Learning: For debugging OpenTelemetry spans, use ConsoleSpanExporter with Traceloop to print spans to console

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

212-218: Consider moving this statement to an else block

(TRY300)

220-221: Remove exception handler; error is immediately re-raised

(TRY203)

221-221: Use raise without specifying exception name

Remove exception name

(TRY201)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

coderabbitai · 2025-12-08T19:04:26Z

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

+
+        return {
+            "prompt": query,
+            "answer": final_completion if final_completion else query,


⚠️ Potential issue | 🟡 Minor

Fallback to query as answer is semantically incorrect.

If final_completion is empty, the code returns the input query as the answer. This conflates the question with the response. Consider returning an empty string or a placeholder like "No completion captured" to preserve data integrity.

Apply this diff:

- "answer": final_completion if final_completion else query, + "answer": final_completion if final_completion else "No completion captured",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"answer": final_completion if final_completion else query,

"answer": final_completion if final_completion else "No completion captured",

🤖 Prompt for AI Agents

In packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py around line 214, the current fallback returns the input query as the answer when final_completion is empty; change this to return a safe fallback (e.g., an empty string "" or the placeholder "No completion captured") instead of the query to avoid conflating question and response, so update the "answer" value to final_completion if present else "" (or your chosen placeholder) and keep the rest of the structure unchanged.

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

ellipsis-dev

Caution

Changes requested ❌

Reviewed 8ea4145 in 2 minutes and 17 seconds. Click for details.

Reviewed 79 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:26

Draft comment:
Using sys.path modification instead of dynamic module loading simplifies import but consider using a proper package structure to avoid side effects.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:112

Draft comment:
Removal of the dynamic module assignment in travel_agent_task in favor of the direct import improves clarity.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_qUQO06CtHHsx3h0A

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev · 2025-12-08T19:27:30Z

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

+
+        for key, value in attributes.items():
+            if key.startswith("gen_ai.prompt."):
+                trajectory_prompts_dict[key] = value


Removal of key conversion (gen_ai.prompt -> llm.prompts) may conflict with evaluator expectations. Update the docstring or conversion logic if this change is intentional.

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (4)

105-111: Ensure returned task fields match documented “Required fields” and planned evaluators

Your docstring lists tool_calls as a required field, and you’re already computing tool calls (plus tool inputs/outputs) in extract_trajectory_from_spans, but the task result currently omits them. The diff above adds tool_calls; if you intend to plug in agent_tool_error_detector later, consider also returning tool_inputs and tool_outputs from trajectory_data so the evaluator has everything it needs.

Also applies to: 182-197

269-278: Result summary uses wrong keys and doesn’t decode JSON trajectories

In the sample results, you read prompts/completions, but travel_agent_task returns trajectory_prompts/trajectory_completions as JSON strings. That will always fall back to '' and counts won’t reflect the captured data. You can decode the JSON and count elements instead:

-                task_result = result.task_result
-                print(f"    - Query: {task_result.get('prompt', 'N/A')[:100]}...")
-                print(f"    - Tools used: {len(task_result.get('tool_calls', []))}")
-                print(f"    - Prompts captured: {len(task_result.get('prompts', '').split('---'))}")
-                print(f"    - Completions captured: {len(task_result.get('completions', '').split('---'))}")
+                task_result = result.task_result
+                print(f"    - Query: {task_result.get('prompt', 'N/A')[:100]}...")
+                print(f"    - Tools used: {len(task_result.get('tool_calls', []))}")
+                prompts_str = task_result.get("trajectory_prompts", "[]")
+                completions_str = task_result.get("trajectory_completions", "[]")
+                try:
+                    prompts_data = json.loads(prompts_str)
+                except (TypeError, json.JSONDecodeError):
+                    prompts_data = []
+                try:
+                    completions_data = json.loads(completions_str)
+                except (TypeError, json.JSONDecodeError):
+                    completions_data = []
+                print(f"    - Prompts captured: {len(prompts_data)}")
+                print(f"    - Completions captured: {len(completions_data)}")

This keeps the summary aligned with the actual task output format.

1-10: Evaluator list is inconsistent (docs/prints say 5, code runs 4)

Module and function docstrings plus console output describe five evaluators including “Agent Tool Error Detector”, but evaluators only contains four (agent_goal_accuracy, agent_flow_quality, agent_efficiency, agent_goal_completeness). Either add EvaluatorMadeByTraceloop.agent_tool_error_detector(...) (and wire the required fields into travel_agent_task) or update all three locations to describe only the four actually used, to avoid confusing users of the example.

Also applies to: 203-224

129-197: Clean up trajectory encoding, answer fallback, and outer try/except in travel_agent_task

This block has a few issues: double‑JSON‑encoding empty trajectories, using the query text as a fallback “answer”, and a try/except that just re‑raises the same exception (flagged by Ruff). The example below normalizes empties without double‑encoding, keeps attribute counts correct, returns a safer fallback answer, includes tool_calls in the task result, and removes the useless try/except:

-    try:
-        # Run the travel agent query
-        print(f"\n{'='*80}")
-        print(f"Running travel agent for query: {query}")
-        print(f"{'='*80}\n")
-
-        tool_calls_made = await run_travel_query(query)
-
-        # Get all captured spans
-        spans = exporter.get_finished_spans()
-
-        print(f"\n{'='*80}")
-        print(f"Captured {len(spans)} spans from travel agent execution")
-        print(f"{'='*80}\n")
-
-        # Extract trajectory from spans
-        trajectory_data = extract_trajectory_from_spans(spans)
-
-        # Get the final completion from llm.completions dict
-        completions_dict = trajectory_data["trajectory_completions"]
-        final_completion = ""
-        if completions_dict:
-            # Find the highest index completion content
-            max_idx = -1
-            for key in completions_dict.keys():
-                if ".content" in key:
-                    try:
-                        parts = key.split(".")
-                        idx = int(parts[2])
-                        if idx > max_idx:
-                            max_idx = idx
-                            final_completion = completions_dict[key]
-                    except (ValueError, IndexError):
-                        pass
-
-        # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys
-        # If empty, use JSON string fallback to avoid validation errors
-        trajectory_prompts = trajectory_data["trajectory_prompts"]
-        trajectory_completions = trajectory_data["trajectory_completions"]
-
-        # Convert to JSON strings if empty (evaluators expect string when no data)
-        if not trajectory_prompts:
-            trajectory_prompts = json.dumps([])
-        if not trajectory_completions:
-            trajectory_completions = json.dumps([])
-
-        print("📊 Trajectory Summary:")
-        print(f"  - Prompt attributes captured: {len(trajectory_prompts)}")
-        print(f"  - Completion attributes captured: {len(trajectory_completions)}")
-        tools_called = ', '.join(trajectory_data['tool_calls']) if trajectory_data['tool_calls'] else 'None'
-        print(f"  - Tools called: {tools_called}")
-        print(f"  - Tools from run: {', '.join(tool_calls_made) if tool_calls_made else 'None'}\n")
-
-        # Return data using field synonyms that map to required evaluator fields
-        # - "prompt" maps to "question"
-        # - "answer" maps to "completion"
-        # - "context" maps to "reference"
-        # - "trajectory_prompts" and "trajectory_completions" as JSON strings
-
-        json_trajectory_prompts = json.dumps(trajectory_prompts)
-        json_trajectory_completions = json.dumps(trajectory_completions)
-
-        return {
-            "prompt": query,
-            "answer": final_completion if final_completion else query,
-            "context": f"The agent should create a complete travel itinerary for: {query}",
-            "trajectory_prompts": json_trajectory_prompts,
-            "trajectory_completions": json_trajectory_completions,
-        }
-
-    except Exception as e:
-        raise e
+    # Run the travel agent query
+    print(f"\n{'='*80}")
+    print(f"Running travel agent for query: {query}")
+    print(f"{'='*80}\n")
+
+    tool_calls_made = await run_travel_query(query)
+
+    # Get all captured spans
+    spans = exporter.get_finished_spans()
+
+    print(f"\n{'='*80}")
+    print(f"Captured {len(spans)} spans from travel agent execution")
+    print(f"{'='*80}\n")
+
+    # Extract trajectory from spans
+    trajectory_data = extract_trajectory_from_spans(spans)
+
+    # Get the final completion from llm.completions dict
+    completions_dict = trajectory_data["trajectory_completions"]
+    final_completion = ""
+    if completions_dict:
+        # Find the highest index completion content
+        max_idx = -1
+        for key in completions_dict.keys():
+            if ".content" in key:
+                try:
+                    parts = key.split(".")
+                    idx = int(parts[2])
+                    if idx > max_idx:
+                        max_idx = idx
+                        final_completion = completions_dict[key]
+                except (ValueError, IndexError):
+                    pass
+
+    # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys
+    trajectory_prompts = trajectory_data["trajectory_prompts"]
+    trajectory_completions = trajectory_data["trajectory_completions"]
+
+    prompt_attr_count = len(trajectory_prompts)
+    completion_attr_count = len(trajectory_completions)
+
+    # Normalise empty trajectories so we don't double‑encode JSON
+    if not trajectory_prompts:
+        trajectory_prompts = []
+    if not trajectory_completions:
+        trajectory_completions = []
+
+    print("📊 Trajectory Summary:")
+    print(f"  - Prompt attributes captured: {prompt_attr_count}")
+    print(f"  - Completion attributes captured: {completion_attr_count}")
+    tools_called = ', '.join(trajectory_data['tool_calls']) if trajectory_data['tool_calls'] else 'None'
+    print(f"  - Tools called: {tools_called}")
+    print(f"  - Tools from run: {', '.join(tool_calls_made) if tool_calls_made else 'None'}\n")
+
+    # Return data using field synonyms that map to required evaluator fields
+    # - "prompt" maps to "question"
+    # - "answer" maps to "completion"
+    # - "context" maps to "reference"
+    # - "trajectory_prompts" and "trajectory_completions" as JSON strings
+
+    json_trajectory_prompts = json.dumps(trajectory_prompts)
+    json_trajectory_completions = json.dumps(trajectory_completions)
+
+    return {
+        "prompt": query,
+        "answer": final_completion if final_completion else "No completion captured",
+        "context": f"The agent should create a complete travel itinerary for: {query}",
+        "trajectory_prompts": json_trajectory_prompts,
+        "trajectory_completions": json_trajectory_completions,
+        "tool_calls": trajectory_data.get("tool_calls", []),
+    }

This should also clear the Ruff TRY201/203/300 and the earlier comments about double‑encoding and the semantically odd answer fallback.

Also applies to: 199-200

🧹 Nitpick comments (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (2)

26-31: Unused # noqa: E402 directive on run_travel_query import

Ruff reports this noqa as unused; if E402 is not enabled in your lint config, you can drop the directive to reduce noise. If you do rely on Flake8’s E402, consider a dynamic import (e.g., via importlib) instead, so the import can live at the top of the file without needing # noqa.

37-51: extract_trajectory_from_spans docstring doesn’t match behavior

The docstring says you convert gen_ai.prompt.* / gen_ai.completion.* to llm.prompts.* / llm.completions.*, but the implementation simply filters gen_ai.* keys without renaming. Please either (a) actually remap the keys to the llm.* prefix, or (b) update the docstring/inline comment so expectations for downstream evaluators are accurate.

Also applies to: 61-66

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5dab9df and 8ea4145.

📒 Files selected for processing (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

🧬 Code graph analysis (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (5)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (3)

agent_flow_quality (569-596)

agent_efficiency (599-618)

agent_goal_completeness (621-640)

packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (2)

InMemorySpanExporter (22-61)

get_finished_spans (40-43)

packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)

TracerWrapper (63-227)

packages/sample-app/sample_app/agents/travel_agent_example.py (1)

run_travel_query (700-719)

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

31-31: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

191-197: Consider moving this statement to an else block

(TRY300)

199-200: Remove exception handler; error is immediately re-raised

(TRY203)

200-200: Use raise without specifying exception name

Remove exception name

(TRY201)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Lint
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 669c6f7 in 49 seconds. Click for details.

Reviewed 22 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:64

Draft comment:
Removed extra trailing whitespace on the assignment to trajectory_completions_dict. This improves code consistency.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:248

Draft comment:
Updated experiment_slug from 'travel-agent-exp-3' to 'travel-agent-exp' for consistency and to fix potential naming issues.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_ruck5OesSDuLF7rf

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (7)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (7)
1-14: Docstring lists 5 evaluators but only 4 are instantiated.

The module docstring references "Agent Tool Error Detector" at line 7, but the evaluators list at lines 228-236 only instantiates four evaluators (missing agent_tool_error_detector). This has been flagged in previous reviews.

193-193: Fallback to query as answer is semantically incorrect.

If final_completion is empty, returning the input query as the answer conflates the question with the response. This has been flagged in previous reviews.

199-200: Remove unnecessary try/except block.

The try/except at lines 129-200 catches and immediately re-raises without adding logging or context. This has been flagged in previous reviews.

219-224: Printed list claims 5 evaluators but only 4 are instantiated.

The console output lists "Agent Tool Error Detector" as evaluator #2, but it's not instantiated in the evaluators list at lines 228-236. This has been flagged in previous reviews.

277-278: Field name mismatch in result inspection.

Lines 277-278 access task_result.get('prompts', '') and task_result.get('completions', ''), but travel_agent_task returns keys named trajectory_prompts and trajectory_completions (lines 195-196). This has been flagged in previous reviews.

166-173: Avoid double JSON encoding for empty trajectories.

When trajectory_prompts or trajectory_completions are empty dicts, lines 171 and 173 convert them to JSON strings (json.dumps([])). Then lines 188-189 call json.dumps() again on these strings, resulting in double-encoded JSON strings like "\"[]\"" instead of "[]".

Apply this diff to initialize with empty lists instead:
-        # Convert to JSON strings if empty (evaluators expect string when no data)
+        # Initialize empty dicts as empty lists for consistent JSON encoding
         if not trajectory_prompts:
-            trajectory_prompts = json.dumps([])
+            trajectory_prompts = []
         if not trajectory_completions:
-            trajectory_completions = json.dumps([])
+            trajectory_completions = []
37-47: Docstring incorrectly claims format conversion.

The docstring at line 40 states "Converts gen_ai.prompt.* to llm.prompts.* format expected by evaluators" and line 46 claims the return dict has "llm.prompts.* keys", but the code at line 63 stores keys unchanged (still as gen_ai.prompt.*). No conversion occurs.

Apply this diff to correct the docstring:
 def extract_trajectory_from_spans(spans):
     """
     Extract prompt and completion trajectory from OpenTelemetry spans.
-    Converts gen_ai.prompt.* to llm.prompts.* format expected by evaluators.
 
     Args:
         spans: List of ReadableSpan objects from InMemorySpanExporter
 
     Returns:
-        dict with trajectory_prompts (as dict with llm.prompts.* keys), trajectory_completions, and tool_calls
+        dict with trajectory_prompts (as dict with gen_ai.prompt.* keys), trajectory_completions (as dict with gen_ai.completion.* keys), and tool_calls
     """

🧹 Nitpick comments (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)
31-31: Remove unused noqa directive.

The noqa: E402 comment is unnecessary as the E402 rule is not enabled in your linting configuration.

Apply this diff:
-from travel_agent_example import run_travel_query  # noqa: E402
+from travel_agent_example import run_travel_query

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 8ea4145 and 669c6f7.

📒 Files selected for processing (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

31-31: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

191-197: Consider moving this statement to an else block

(TRY300)

199-200: Remove exception handler; error is immediately re-raised

(TRY203)

200-200: Use raise without specifying exception name

Remove exception name

(TRY201)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)

283-289: LGTM!

The main block correctly sets up and executes the experiment with clear informational output.

ellipsis-dev

Caution

Changes requested ❌

Reviewed 9b03e5a in 2 minutes and 52 seconds. Click for details.

Reviewed 379 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 3 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:56

Draft comment:
Using run_with_span_capture cleans up the span extraction logic. Consider adding a note in the docstring that it returns JSON‐encoded trajectories to clarify the expected format downstream.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% The comment is suggesting a documentation improvement rather than pointing out a bug or required code change. According to the rules, comments should only be made if there is "clearly a code change required" and should not be "obvious or unimportant." This comment is about adding clarifying documentation, which is somewhat subjective. The comment also makes an assumption about what run_with_span_capture returns (JSON-encoded trajectories) without me being able to verify this from the diff. The old code had explicit JSON encoding, but the new code just assigns the return values directly to dictionary fields. Without seeing the implementation of run_with_span_capture, I cannot confirm if the comment's premise is correct. The comment is also somewhat speculative about what would be helpful "downstream." The comment might be helpful if the return format is indeed unclear and causes confusion. Documentation improvements can prevent future bugs. However, I cannot verify from the diff alone whether run_with_span_capture returns JSON strings or dictionaries, so the comment might be based on incorrect assumptions. While documentation can be helpful, this comment doesn't point to a clear code issue. It's a suggestion for a minor documentation enhancement that may or may not be necessary depending on how clear the run_with_span_capture function's own documentation is. The rules state to only keep comments where there's "STRONG EVIDENCE that the comment is correct" and this requires cross-file context to verify. This comment suggests a documentation improvement but doesn't identify a clear code defect. It requires understanding the implementation of run_with_span_capture (which is in another file) to verify if it's even correct. Since it's not pointing to a required code change and understanding it requires cross-file context, it should be deleted per the rules.

2. packages/traceloop-sdk/traceloop/sdk/experiment/utils.py:106

Draft comment:
Consider using a logging framework instead of print statements for better production logging and control.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

3. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:59

Draft comment:
Typo: In the comment, consider changing "This is the agents input" to "This is the agent's input" to correctly use the possessive form.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This is a pure grammar/typo fix in a code comment. According to the rules, I should "Do NOT comment unless there is clearly a code change required" and "Do NOT make comments that are obvious or unimportant." A typo in a comment doesn't require a code change - the code functions identically. This is an extremely minor issue that doesn't affect functionality, readability in any meaningful way, or code quality. While the grammar correction is technically accurate, it falls into the category of "obvious or unimportant" comments. The comment is perfectly understandable as-is, and fixing this typo provides minimal value. Could this be considered important for code quality and professionalism? Some teams have strict standards about grammar in comments, and maintaining high-quality documentation could be valuable for long-term maintainability. While documentation quality can be important, this is an inline comment explaining a function parameter, not user-facing documentation. The meaning is completely clear even with the minor grammatical error. The rules explicitly state to avoid "obvious or unimportant" comments, and a possessive apostrophe in an inline comment clearly falls into this category. This comment should be deleted. It's a minor grammatical correction in an inline code comment that doesn't affect functionality, clarity, or code quality in any meaningful way. It violates the rule about not making obvious or unimportant comments.

Workflow ID: wflow_2cqqOe16JNMxUDXE

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)
141-160: Fix empty-trajectory handling to avoid double JSON encoding and incorrect counts.

Current flow for empty trajectories:

trajectory_prompts / trajectory_completions start as {}.

For empties, they’re set to json.dumps([]) (a string "[]").

Later both are wrapped again with json.dumps(...), yielding "\"[]\"".

len(trajectory_prompts) / len(trajectory_completions) then measure string length (2), not attribute counts.

This breaks consumers expecting a single-layer JSON string and makes the summary counts wrong.

Suggested change:

Keep the internal values as structured objects (dict/list), only JSON-encode once at the end:
-        trajectory_prompts = trajectory_data["trajectory_prompts"]
-        trajectory_completions = trajectory_data["trajectory_completions"]
-
-        # Convert to JSON strings if empty (evaluators expect string when no data)
-        if not trajectory_prompts:
-            trajectory_prompts = json.dumps([])
-        if not trajectory_completions:
-            trajectory_completions = json.dumps([])
+        trajectory_prompts = trajectory_data["trajectory_prompts"]
+        trajectory_completions = trajectory_data["trajectory_completions"]
+
+        # Use an empty list as the canonical “no trajectory” structure
+        if not trajectory_prompts:
+            trajectory_prompts = []
+        if not trajectory_completions:
+            trajectory_completions = []
@@
-        print(f"  - Prompt attributes captured: {len(trajectory_prompts)}")
-        print(f"  - Completion attributes captured: {len(trajectory_completions)}")
+        print(f"  - Prompt attributes captured: {len(trajectory_prompts)}")
+        print(f"  - Completion attributes captured: {len(trajectory_completions)}")
@@
-        json_trajectory_prompts = json.dumps(trajectory_prompts)
-        json_trajectory_completions = json.dumps(trajectory_completions)
+        json_trajectory_prompts = json.dumps(trajectory_prompts)
+        json_trajectory_completions = json.dumps(trajectory_completions)
This preserves summary semantics and returns a single JSON layer while keeping the earlier reviewer feedback about empty lists in mind.
packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (2)
1-14: Docstrings and console text advertise 5 evaluators, but only 4 are instantiated.

Module and function docstrings, plus the printed description, list “Agent Tool Error Detector” and say there are five evaluators.

The actual evaluators list only includes:

agent_goal_accuracy

agent_flow_quality

agent_efficiency

agent_goal_completeness

Please either:

Add EvaluatorMadeByTraceloop.agent_tool_error_detector(...) to the list (and ensure the task output exposes whatever tool_input / tool_output fields that evaluator requires), or

Update all docstrings and printed descriptions to only mention the four evaluators you actually run.

Right now the example is misleading for users following the documented behavior.

Also applies to: 71-93, 95-104

35-68: Avoid using the input query as the fallback “answer”.

In travel_agent_task, the returned payload uses:
"answer": final_completion if final_completion else query,
When no completion is captured, this reports the original question as the answer, which corrupts downstream evaluation (e.g., goal accuracy, flow quality) and makes debugging harder.

Prefer a neutral or explicit fallback, e.g.:
-        "answer": final_completion if final_completion else query,
+        "answer": final_completion if final_completion else "No completion captured",
(or an empty string / None, depending on how the evaluators handle missing completions).

🧹 Nitpick comments (1)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)
163-164: Remove no-op try/except wrapper around the function body.

The except Exception as e: raise e adds no value and triggers Ruff TRY201/TRY203/TRY300 hints. You can safely drop the outer try/except and let exceptions propagate with their original traceback:
-    try:
-        # Run the task callable
+    # Run the task callable
@@
-        json_trajectory_completions = json.dumps(trajectory_completions)
-
-        return json_trajectory_prompts, json_trajectory_completions, final_completion
-
-    except Exception as e:
-        raise e
+        json_trajectory_completions = json.dumps(trajectory_completions)
+
+        return json_trajectory_prompts, json_trajectory_completions, final_completion

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 669c6f7 and 9b03e5a.

📒 Files selected for processing (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py
packages/traceloop-sdk/traceloop/sdk/experiment/utils.py

🧬 Code graph analysis (2)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (4)

agent_goal_accuracy (413-433)

agent_flow_quality (569-596)

agent_efficiency (599-618)

agent_goal_completeness (621-640)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)

run_with_span_capture (68-164)

packages/sample-app/sample_app/agents/travel_agent_example.py (1)

run_travel_query (700-719)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (3)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (2)

InMemorySpanExporter (22-61)

get_finished_spans (40-43)

packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)

TracerWrapper (63-227)

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

29-29: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py

161-161: Consider moving this statement to an else block

(TRY300)

163-164: Remove exception handler; error is immediately re-raised

(TRY203)

164-164: Use raise without specifying exception name

Remove exception name

(TRY201)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.12)
GitHub Check: Lint

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 140bb3a in 53 seconds. Click for details.

Reviewed 15 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:57

Draft comment:
Minor whitespace reformatting for clarity. No logic or performance issues.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py:59

Draft comment:
Typo: The comment 'This is the agents input' should likely be "This is the agent's input" (with an apostrophe) for grammatical correctness.
Reason this comment was not posted:
Comment was on unchanged code.

Workflow ID: wflow_gkEiIegWgGZNwqFw

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (4)
1-14: Docstring lists 5 evaluators but only 4 are implemented.

The module docstring references "Agent Tool Error Detector" on line 7, but the evaluators list at lines 96-104 only instantiates four evaluators (agent_goal_accuracy, agent_flow_quality, agent_efficiency, agent_goal_completeness). Update the docstring to remove line 7 or add the missing evaluator to lines 96-104.

62-68: Fallback to query as answer is semantically incorrect.

On line 64, when final_completion is empty, the code returns the input query as the answer. This conflates the question with the response. Use an empty string or a placeholder like "No completion captured" instead.

Apply this diff:
     return {
         "prompt": query,
-        "answer": final_completion if final_completion else query,
+        "answer": final_completion if final_completion else "No completion captured",
         "context": f"The agent should create a complete travel itinerary for: {query}",
         "trajectory_prompts": trajectory_prompts,
         "trajectory_completions": trajectory_completions,
     }
73-81: Function docstring lists 5 evaluators but only 4 are implemented.

The docstring references "Agent Tool Error Detector" in item 2, but the evaluators list at lines 96-104 only instantiates four evaluators. Update the docstring to match the actual implementation.

88-92: Print statements list 5 evaluators but only 4 are implemented.

Lines 88-92 print a list that includes "Agent Tool Error Detector" on line 89, but the evaluators list at lines 96-104 only instantiates four evaluators. Update the print statements to match the actual implementation or add the missing evaluator.

🧹 Nitpick comments (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1)
29-29: Remove unused noqa directive.

Ruff reports that the noqa: E402 directive is unused. Remove it to clean up the code.

Apply this diff:
-from travel_agent_example import run_travel_query  # noqa: E402
+from travel_agent_example import run_travel_query

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9b03e5a and 140bb3a.

📒 Files selected for processing (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

🧬 Code graph analysis (1)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py (3)

packages/traceloop-sdk/traceloop/sdk/evaluator/evaluators_made_by_traceloop.py (5)

EvaluatorMadeByTraceloop (5-729)

agent_goal_accuracy (413-433)

agent_flow_quality (569-596)

agent_efficiency (599-618)

agent_goal_completeness (621-640)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)

run_with_span_capture (68-164)

packages/sample-app/sample_app/agents/travel_agent_example.py (1)

run_travel_query (700-719)

🪛 Ruff (0.14.7)

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py

29-29: Unused noqa directive (non-enabled: E402)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.11)
GitHub Check: Lint
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.12)

ellipsis-dev

Caution

Changes requested ❌

Reviewed 1a8b20c in 1 minute and 38 seconds. Click for details.

Reviewed 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 0 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

Workflow ID: wflow_0Zc7lUyWJMBji9uk

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (3)
104-105: Remove the no-op try/except that just re-raises the same exception

The try block covers almost the entire function, and the except Exception as e: raise e handler adds no value (and is what Ruff flags with TRY203/TRY201). Letting exceptions propagate naturally keeps tracebacks intact and simplifies control flow.

You can just drop the try/except entirely:
-    try:
-        # Run the task callable
+    # Run the task callable
@@
-        return json_trajectory_prompts, json_trajectory_completions, final_completion
-
-    except Exception as e:
-        raise e
+        return json_trajectory_prompts, json_trajectory_completions, final_completion
Also applies to: 163-164

11-21: Align extract_trajectory_from_spans docstring with actual behavior and returned fields

The docstring still says gen_ai.* attributes are converted to llm.prompts.* / llm.completions.* and that only prompts/completions/tool_calls are returned, but the implementation:

Keeps keys as gen_ai.prompt.* / gen_ai.completion.* without renaming.

Also returns tool_inputs and tool_outputs.

Either update the implementation to actually produce llm.prompts.* / llm.completions.* and restrict the return shape, or (probably simpler here) update the docstring to describe that gen_ai.* keys are passed through unchanged and that tool_calls, tool_inputs, and tool_outputs are part of the contract.

Also applies to: 59-65

139-160: Fix double JSON-encoding of empty trajectories and incorrect attribute counts

When there are no prompts/completions, trajectory_prompts / trajectory_completions are set to json.dumps([]) and then wrapped again with json.dumps(...), producing "\"[]\"" instead of "[]". It also makes the “Prompt attributes captured” count show len("[]") == 2 instead of 0.

You can avoid both issues and still return empty arrays as JSON like this:
-        # trajectory_prompts and trajectory_completions are dicts with llm.prompts/completions.* keys
-        # If empty, use JSON string fallback to avoid validation errors
-        trajectory_prompts = trajectory_data["trajectory_prompts"]
-        trajectory_completions = trajectory_data["trajectory_completions"]
-
-        # Convert to JSON strings if empty (evaluators expect string when no data)
-        if not trajectory_prompts:
-            trajectory_prompts = json.dumps([])
-        if not trajectory_completions:
-            trajectory_completions = json.dumps([])
+        # trajectory_prompts and trajectory_completions are dicts with gen_ai.* keys
+        # Keep them as Python objects for counting; build JSON strings below
+        trajectory_prompts = trajectory_data["trajectory_prompts"]
+        trajectory_completions = trajectory_data["trajectory_completions"]
@@
-        json_trajectory_prompts = json.dumps(trajectory_prompts)
-        json_trajectory_completions = json.dumps(trajectory_completions)
-
-        return json_trajectory_prompts, json_trajectory_completions, final_completion
+        if trajectory_prompts:
+            json_trajectory_prompts = json.dumps(trajectory_prompts)
+        else:
+            json_trajectory_prompts = json.dumps([])
+
+        if trajectory_completions:
+            json_trajectory_completions = json.dumps(trajectory_completions)
+        else:
+            json_trajectory_completions = json.dumps([])
+
+        return json_trajectory_prompts, json_trajectory_completions, final_completion
This preserves correct counts and produces "[]" in the “no data” case instead of a double-encoded string.

🧹 Nitpick comments (1)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1)

90-103: Be cautious resetting TracerWrapper singleton and reinitializing Traceloop in-process

Deleting TracerWrapper.instance and calling Traceloop.init(...) unconditionally is fine for a dedicated experiment runner, but in a long‑lived process (e.g., app using Traceloop elsewhere) this will tear down and replace the existing tracer configuration, potentially surprising callers that already initialized tracing.

If this utility is intended strictly for standalone experiments, consider documenting that explicitly; otherwise you may want a code path that:

Detects an existing initialized tracer and either reuses it, or

Makes re‑initialization opt‑in via a parameter.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 140bb3a and 1a8b20c.

📒 Files selected for processing (1)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py (3)

packages/traceloop-sdk/traceloop/sdk/__init__.py (2)

Traceloop (36-267)

init (48-198)

packages/traceloop-sdk/traceloop/sdk/utils/in_memory_span_exporter.py (1)

get_finished_spans (40-43)

packages/traceloop-sdk/traceloop/sdk/tracing/tracing.py (1)

TracerWrapper (63-227)

🪛 Ruff (0.14.7)

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py

161-161: Consider moving this statement to an else block

(TRY300)

163-164: Remove exception handler; error is immediately re-raised

(TRY203)

164-164: Use raise without specifying exception name

Remove exception name

(TRY201)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Lint
GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)

ellipsis-dev

Important

Looks good to me! 👍

Reviewed e72a010 in 39 seconds. Click for details.

Reviewed 12 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/pyproject.toml:124

Draft comment:
Add a comment explaining why 'traceloop/sdk/experiment/utils.py' is excluded from type checking.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_7r0bmYjnwTIvhXb0

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

packages/traceloop-sdk/pyproject.toml (1)

124-124: Reconsider broad module exclusion from type checking in a strict mypy configuration.

The mypy configuration enforces disallow_untyped_defs = true (line 105), yet you're excluding an entire new module from type checking. This creates an inconsistency and introduces a potential maintenance burden. The enriched summary mentions "dynamic typing patterns," but excluding a full module is a code smell that suggests untyped code could hide runtime type errors.

Consider one of these alternatives:

Properly type the module to align with the strict configuration (preferred)

Use narrower exclusion for specific functions/classes that genuinely require dynamic typing (e.g., using # type: ignore comments on specific lines)

Document why broad exclusion is necessary if it must remain (e.g., in a comment or ADR)

Add type stubs for the dynamic parts if the code genuinely can't be typed

Can you clarify why the entire utils.py module requires exclusion from type checking? If there are specific functions with dynamic patterns that resist typing, we could use more targeted suppression instead of excluding the whole module.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1a8b20c and e72a010.

📒 Files selected for processing (1)

packages/traceloop-sdk/pyproject.toml (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

nina-kollman added 13 commits December 8, 2025 09:16

added

f6d9add

fix input

6873bac

added agent exp

a130003

added field mapping

d0e15f6

rename

2a76ce7

lint

2b37b95

fix test

580167d

remove

3532412

added context_relevance

5f6a12e

add agent

c27da2f

lint

2f8dd94

init

bf321c8

t2

45df8ae

nina-kollman added 4 commits December 8, 2025 18:09

t3

6f1a4c7

Merge branch 'main' of https://github.com/traceloop/openllmetry into …

189872b

…nk/travel_agent_case

trajectory look good

fedcc88

added

d2442f8

nina-kollman changed the title ~~Nk/travel agent case~~ fix(exp): Add a real agent example Dec 8, 2025

add config

f19d82c

nina-kollman marked this pull request as ready for review December 8, 2025 18:47

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

lint

5dab9df

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

doronkopit5 reviewed Dec 8, 2025

View reviewed changes

packages/sample-app/sample_app/experiment/made_by_traceloop/travel_agent_exp.py Outdated Show resolved Hide resolved

doronkopit5 approved these changes Dec 8, 2025

View reviewed changes

import

8ea4145

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

lint

669c6f7

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

extract to util

9b03e5a

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py Show resolved Hide resolved

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py Show resolved Hide resolved

lint

140bb3a

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py Show resolved Hide resolved

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

comment

1a8b20c

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/experiment/utils.py Show resolved Hide resolved

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

ignore

e72a010

ellipsis-dev bot reviewed Dec 8, 2025

View reviewed changes

coderabbitai bot reviewed Dec 8, 2025

View reviewed changes

nina-kollman merged commit 2ee3930 into main Dec 8, 2025
12 checks passed

	"answer": final_completion if final_completion else query,
	"answer": final_completion if final_completion else "No completion captured",

fix(exp): Add a real agent example #3507

fix(exp): Add a real agent example #3507

Uh oh!

Conversation

nina-kollman commented Dec 8, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot Dec 8, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

nina-kollman commented Dec 8, 2025 •

edited by ellipsis-dev bot

Loading

coderabbitai bot commented Dec 8, 2025 •

edited

Loading